Mmdetection: [fp16 training error] CUDA error: device-side assert triggered

Created on 2 Jul 2019 · 23Comments · Source: open-mmlab/mmdetection

Checklist

[O ] I have searched related issues but could not get the expected help.
[O ] The bug has not been fixed in the latest version.

Describe the bug
A clear and concise description of what the bug is.
If there are any related issues or upstream bugs, please also refer to them.

Error traceback

What command or script did you run?

I run the following command to train mask_rcnn_r50_fpn_fp16
==============================================
NUM_GPUS=4
CONFIG=mmdetection/configs/fp16/mask_rcnn_r50_fpn_fp16_1x.py'
WORK_DIR=work_dirs/mask_rcnn_r50_fpn_fp16_1x' 

tools/dist_train.sh $CONFIG $NUM_GPUS --validate --work_dir $WORK_DIR
==============================================

If applicable, paste the error trackback here using code blocks.

Because it is too long, i will paste it in the end.

Reproduction details

Did you make any modifications on the code? Did you understand what you have modified?
No

What dataset did you use?
COCO

Environment

OS: Ubuntu 16.04.4
GCC 5.4.0
PyTorch version 1.1.0
- How you installed PyTorch : conda (inside docker)
- GPU model : V100 32GB (NVLink)
- CUDA and CUDNN version : CUDA 9.0 , cuDNN 7

When I try to train fp 16 model,
CUDA error: device-side assert triggered
(insert_events at ../c10/cuda/CUDACachingAllocator.cpp:564)

and many repetitive following messages
/tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [1,0,0], thread: [80,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.

when i comment out fp16 configuration , it doesn't produce error.
https://github.com/open-mmlab/mmdetection/blob/master/configs/fp16/mask_rcnn_r50_fpn_fp16_1x.py#L2

Error message
``/home/user/Desktop/workspace_zacurr/mmdetection/work_dirs/mask_rcnn_r50_fpn_fp16_1x
Directory exists
loading annotations into memory...
2019-07-02 00:56:29,056 - INFO - Distributed training: True
2019-07-02 00:56:29,549 - INFO - load model from: modelzoo://resnet50
loading annotations into memory...
2019-07-02 00:56:29,828 - WARNING - unexpected key in source state_dict: fc.weight, fc.bias

missing keys in source state_dict: layer3.0.bn1.num_batches_tracked, layer2.0.bn3.num_batches_tracked, layer2.2.bn1.num_batches_tracked, layer2.1.bn2.num_batches_tracked, layer1.0.bn2.num_batches_tracked, layer3.0.downsample.1.num_batches_tracked, layer4.1.bn2.num_batches_tracked, layer3.5.bn2.num_batches_tracked, layer3.1.bn2.num_batches_tracked, layer3.4.bn1.num_batches_tracked, layer1.2.bn2.num_batches_tracked, layer3.2.bn2.num_batches_tracked, layer3.1.bn3.num_batches_tracked, layer4.2.bn2.num_batches_tracked, layer2.0.bn2.num_batches_tracked, layer2.3.bn1.num_batches_tracked, layer4.2.bn3.num_batches_tracked, layer3.4.bn3.num_batches_tracked, layer3.2.bn3.num_batches_tracked, layer1.0.downsample.1.num_batches_tracked, layer2.1.bn1.num_batches_tracked, layer3.3.bn3.num_batches_tracked, layer4.0.downsample.1.num_batches_tracked, layer4.0.bn1.num_batches_tracked, layer4.0.bn3.num_batches_tracked, layer1.1.bn2.num_batches_tracked, layer3.0.bn3.num_batches_tracked, layer3.2.bn1.num_batches_tracked, layer3.0.bn2.num_batches_tracked, layer4.0.bn2.num_batches_tracked, layer2.2.bn2.num_batches_tracked, layer3.5.bn3.num_batches_tracked, layer1.0.bn1.num_batches_tracked, layer2.3.bn3.num_batches_tracked, layer1.0.bn3.num_batches_tracked, layer3.3.bn2.num_batches_tracked, layer4.1.bn1.num_batches_tracked, layer1.1.bn3.num_batches_tracked, layer2.3.bn2.num_batches_tracked, layer3.3.bn1.num_batches_tracked, layer3.1.bn1.num_batches_tracked, layer3.5.bn1.num_batches_tracked, layer2.0.downsample.1.num_batches_tracked, layer1.1.bn1.num_batches_tracked, layer3.4.bn2.num_batches_tracked, bn1.num_batches_tracked, layer1.2.bn1.num_batches_tracked, layer4.2.bn1.num_batches_tracked, layer2.0.bn1.num_batches_tracked, layer4.1.bn3.num_batches_tracked, layer2.1.bn3.num_batches_tracked, layer1.2.bn3.num_batches_tracked, layer2.2.bn3.num_batches_tracked

loading annotations into memory...
loading annotations into memory...
Done (t=12.76s)
creating index...
Done (t=12.47s)
creating index...
index created!
Done (t=12.82s)
creating index...
index created!
index created!
Done (t=13.82s)
creating index...
index created!
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
loading annotations into memory...
Done (t=1.77s)
creating index...
index created!
Done (t=2.36s)
creating index...
Done (t=2.39s)
creating index...
index created!
index created!
Done (t=2.53s)
creating index...
index created!
2019-07-02 00:56:53,981 - INFO - Start running, host: root@b6940c72ef4f, work_dir: /home/user/Desktop/workspace_zacurr/mmdetection/work_dirs/mask_rcnn_r50_fpn_fp16_1x
2019-07-02 00:56:53,981 - INFO - workflow: [('train', 1)], max: 12 epochs
/tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [5,0,0], thread: [96,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [5,0,0], thread: [97,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [5,0,0], thread: [98,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
... omitted...
/tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [1,0,0], thread: [126,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/tmp/pip-req-build-fl_vaj2n/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda ->auto::operator()(int)->auto: block: [1,0,0], thread: [127,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
Traceback (most recent call last):
File "/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py", line 98, in
main()
File "/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py", line 94, in main
logger=logger)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 60, in train_detector
_dist_train(model, dataset, cfg, validate=validate)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 189, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 356, in run
epoch_runner(data_loaders[i], kwargs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 262, in train
self.model, data_batch, train_mode=True, *kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 40, in batch_processor
losses = model(data)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in __call__
result = self.forward(input, kwargs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/parallel/distributed.py", line 50, in forward
return self.module(inputs[0], *kwargs[0])
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in __call__
result = self.forward(input, *kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py", line 75, in new_func
output = old_func(new_args, *new_kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/base.py", line 86, in forward
return self.forward_train(img, img_meta, *kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/two_stage.py", line 114, in forward_train
proposal_list = self.rpn_head.get_bboxes(proposal_inputs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py", line 152, in new_func
output = old_func(new_args, *new_kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/anchor_head.py", line 221, in get_bboxes
scale_factor, cfg, rescale)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/rpn_head.py", line 83, in get_bboxes_single
self.target_stds, img_shape)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/bbox/transforms.py", line 40, in delta2bbox
means = deltas.new_tensor(means).repeat(1, deltas.size(1) // 4)
RuntimeError: CUDA error: device-side assert triggered
Traceback (most recent call last):
File "/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py", line 98, in
main()
File "/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py", line 94, in main
logger=logger)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 60, in train_detector
_dist_train(model, dataset, cfg, validate=validate)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 189, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 356, in run
epoch_runner(data_loaders[i], *kwargs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 262, in train
self.model, data_batch, train_mode=True, *kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 40, in batch_processor
losses = model(data)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in __call__
result = self.forward(input, kwargs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/parallel/distributed.py", line 50, in forward
return self.module(inputs[0], *kwargs[0])
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in __call__
result = self.forward(input, *kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py", line 75, in new_func
output = old_func(new_args, *new_kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/base.py", line 86, in forward
return self.forward_train(img, img_meta, *kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/two_stage.py", line 114, in forward_train
proposal_list = self.rpn_head.get_bboxes(proposal_inputs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py", line 152, in new_func
output = old_func(new_args, *new_kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/anchor_head.py", line 221, in get_bboxes
scale_factor, cfg, rescale)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/rpn_head.py", line 83, in get_bboxes_single
self.target_stds, img_shape)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/bbox/transforms.py", line 40, in delta2bbox
means = deltas.new_tensor(means).repeat(1, deltas.size(1) // 4)
RuntimeError: CUDA error: device-side assert triggered
Traceback (most recent call last):
File "/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py", line 98, in
main()
File "/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py", line 94, in main
logger=logger)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 60, in train_detector
_dist_train(model, dataset, cfg, validate=validate)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 189, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 356, in run
epoch_runner(data_loaders[i], *kwargs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 262, in train
self.model, data_batch, train_mode=True, *kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 40, in batch_processor
losses = model(data)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in __call__
result = self.forward(input, kwargs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/parallel/distributed.py", line 50, in forward
return self.module(inputs[0], *kwargs[0])
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in __call__
result = self.forward(input, *kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py", line 75, in new_func
output = old_func(new_args, *new_kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/base.py", line 86, in forward
return self.forward_train(img, img_meta, *kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/two_stage.py", line 114, in forward_train
proposal_list = self.rpn_head.get_bboxes(proposal_inputs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py", line 152, in new_func
output = old_func(new_args, *new_kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/anchor_head.py", line 221, in get_bboxes
scale_factor, cfg, rescale)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/rpn_head.py", line 83, in get_bboxes_single
self.target_stds, img_shape)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/bbox/transforms.py", line 40, in delta2bbox
means = deltas.new_tensor(means).repeat(1, deltas.size(1) // 4)
RuntimeError: CUDA error: device-side assert triggered
Traceback (most recent call last):
File "/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py", line 98, in
main()
File "/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py", line 94, in main
logger=logger)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 60, in train_detector
_dist_train(model, dataset, cfg, validate=validate)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 189, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 356, in run
epoch_runner(data_loaders[i], *kwargs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/runner/runner.py", line 262, in train
self.model, data_batch, train_mode=True, *kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/apis/train.py", line 40, in batch_processor
losses = model(data)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in __call__
result = self.forward(input, *kwargs)
File "/opt/conda/lib/python3.6/site-packages/mmcv/parallel/distributed.py", line 50, in forward
return self.module(inputs[0], *kwargs[0])
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 494, in __call__
result = self.forward(input, *kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py", line 75, in new_func
output = old_func(new_args, *new_kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/base.py", line 86, in forward
return self.forward_train(img, img_meta, *kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/detectors/two_stage.py", line 114, in forward_train
proposal_list = self.rpn_head.get_bboxes(proposal_inputs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/fp16/decorators.py", line 152, in new_func
output = old_func(new_args, *new_kwargs)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/anchor_head.py", line 221, in get_bboxes
scale_factor, cfg, rescale)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/models/anchor_heads/rpn_head.py", line 83, in get_bboxes_single
self.target_stds, img_shape)
File "/home/user/Desktop/workspace_zacurr/mmdetection/mmdet/core/bbox/transforms.py", line 40, in delta2bbox
means = deltas.new_tensor(means).repeat(1, deltas.size(1) // 4)
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered (insert_events at ../c10/cuda/CUDACachingAllocator.cpp:564)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator > const&) + 0x6a (0x7fe9d572d66a in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: + 0x140e0 (0x7fe9cf61b0e0 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x61 (0x7fe9d571b661 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: torch::autograd::Variable::Impl::release_resources() + 0x5e (0x7fe9d4d160ae in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #4: + 0x1333fb (0x7fe9ed5f13fb in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0x352ae4 (0x7fe9ed810ae4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x352b41 (0x7fe9ed810b41 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x19dbbc (0x5575e53ecbbc in /opt/conda/bin/python)
frame #8: + 0xf32a8 (0x5575e53422a8 in /opt/conda/bin/python)
frame #9: + 0xf343a (0x5575e534243a in /opt/conda/bin/python)
frame #10: + 0xf2c77 (0x5575e5341c77 in /opt/conda/bin/python)
frame #11: + 0xf2b07 (0x5575e5341b07 in /opt/conda/bin/python)
frame #12: + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python)
frame #13: + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python)
frame #14: + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python)
frame #15: + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python)
frame #16: + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python)
frame #17: + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python)
frame #18: + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python)
frame #19: + 0xf2b1d (0x5575e5341b1d in /opt/conda/bin/python)
frame #20: PyDict_SetItem + 0x3da (0x5575e5387d4a in /opt/conda/bin/python)
frame #21: PyDict_SetItemString + 0x4f (0x5575e539084f in /opt/conda/bin/python)
frame #22: PyImport_Cleanup + 0x99 (0x5575e53f6b79 in /opt/conda/bin/python)
frame #23: Py_FinalizeEx + 0x61 (0x5575e5461961 in /opt/conda/bin/python)
frame #24: Py_Main + 0x355 (0x5575e546beb5 in /opt/conda/bin/python)
frame #25: main + 0xee (0x5575e5333b4e in /opt/conda/bin/python)
frame #26: __libc_start_main + 0xf0 (0x7fea04b54830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #27: + 0x1c61a8 (0x5575e54151a8 in /opt/conda/bin/python)

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered (insert_events at ../c10/cuda/CUDACachingAllocator.cpp:564)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator > const&) + 0x6a (0x7f6c09f7766a in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: + 0x140e0 (0x7f6c03e650e0 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x61 (0x7f6c09f65661 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: torch::autograd::Variable::Impl::release_resources() + 0x5e (0x7f6c095600ae in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #4: + 0x1333fb (0x7f6c21e3b3fb in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0x352ae4 (0x7f6c2205aae4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x352b41 (0x7f6c2205ab41 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x19dbbc (0x56145130ebbc in /opt/conda/bin/python)
frame #8: + 0xf32a8 (0x5614512642a8 in /opt/conda/bin/python)
frame #9: + 0xf343a (0x56145126443a in /opt/conda/bin/python)
frame #10: + 0xf2c77 (0x561451263c77 in /opt/conda/bin/python)
frame #11: + 0xf2b07 (0x561451263b07 in /opt/conda/bin/python)
frame #12: + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python)
frame #13: + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python)
frame #14: + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python)
frame #15: + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python)
frame #16: + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python)
frame #17: + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python)
frame #18: + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python)
frame #19: + 0xf2b1d (0x561451263b1d in /opt/conda/bin/python)
frame #20: PyDict_SetItem + 0x3da (0x5614512a9d4a in /opt/conda/bin/python)
frame #21: PyDict_SetItemString + 0x4f (0x5614512b284f in /opt/conda/bin/python)
frame #22: PyImport_Cleanup + 0x99 (0x561451318b79 in /opt/conda/bin/python)
frame #23: Py_FinalizeEx + 0x61 (0x561451383961 in /opt/conda/bin/python)
frame #24: Py_Main + 0x355 (0x56145138deb5 in /opt/conda/bin/python)
frame #25: main + 0xee (0x561451255b4e in /opt/conda/bin/python)
frame #26: __libc_start_main + 0xf0 (0x7f6c3939e830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #27: + 0x1c61a8 (0x5614513371a8 in /opt/conda/bin/python)

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered (insert_events at ../c10/cuda/CUDACachingAllocator.cpp:564)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator > const&) + 0x6a (0x7fa03010666a in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: + 0x140e0 (0x7fa029ff40e0 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x61 (0x7fa0300f4661 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: torch::autograd::Variable::Impl::release_resources() + 0x5e (0x7fa02f6ef0ae in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #4: + 0x1333fb (0x7fa047fca3fb in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0x352ae4 (0x7fa0481e9ae4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x352b41 (0x7fa0481e9b41 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x19dbbc (0x564f6c1dabbc in /opt/conda/bin/python)
frame #8: + 0xf32a8 (0x564f6c1302a8 in /opt/conda/bin/python)
frame #9: + 0xf343a (0x564f6c13043a in /opt/conda/bin/python)
frame #10: + 0xf2c77 (0x564f6c12fc77 in /opt/conda/bin/python)
frame #11: + 0xf2b07 (0x564f6c12fb07 in /opt/conda/bin/python)
frame #12: + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python)
frame #13: + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python)
frame #14: + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python)
frame #15: + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python)
frame #16: + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python)
frame #17: + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python)
frame #18: + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python)
frame #19: + 0xf2b1d (0x564f6c12fb1d in /opt/conda/bin/python)
frame #20: PyDict_SetItem + 0x3da (0x564f6c175d4a in /opt/conda/bin/python)
frame #21: PyDict_SetItemString + 0x4f (0x564f6c17e84f in /opt/conda/bin/python)
frame #22: PyImport_Cleanup + 0x99 (0x564f6c1e4b79 in /opt/conda/bin/python)
frame #23: Py_FinalizeEx + 0x61 (0x564f6c24f961 in /opt/conda/bin/python)
frame #24: Py_Main + 0x355 (0x564f6c259eb5 in /opt/conda/bin/python)
frame #25: main + 0xee (0x564f6c121b4e in /opt/conda/bin/python)
frame #26: __libc_start_main + 0xf0 (0x7fa05f52d830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #27: + 0x1c61a8 (0x564f6c2031a8 in /opt/conda/bin/python)

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered (insert_events at ../c10/cuda/CUDACachingAllocator.cpp:564)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator > const&) + 0x6a (0x7f23624b566a in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: + 0x140e0 (0x7f235c3a30e0 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x61 (0x7f23624a3661 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: torch::autograd::Variable::Impl::release_resources() + 0x5e (0x7f2361a9e0ae in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #4: + 0x1333fb (0x7f237a3793fb in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0x352ae4 (0x7f237a598ae4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x352b41 (0x7f237a598b41 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x19dbbc (0x55f393649bbc in /opt/conda/bin/python)
frame #8: + 0xf32a8 (0x55f39359f2a8 in /opt/conda/bin/python)
frame #9: + 0xf343a (0x55f39359f43a in /opt/conda/bin/python)
frame #10: + 0xf2c77 (0x55f39359ec77 in /opt/conda/bin/python)
frame #11: + 0xf2b07 (0x55f39359eb07 in /opt/conda/bin/python)
frame #12: + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python)
frame #13: + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python)
frame #14: + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python)
frame #15: + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python)
frame #16: + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python)
frame #17: + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python)
frame #18: + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python)
frame #19: + 0xf2b1d (0x55f39359eb1d in /opt/conda/bin/python)
frame #20: PyDict_SetItem + 0x3da (0x55f3935e4d4a in /opt/conda/bin/python)
frame #21: PyDict_SetItemString + 0x4f (0x55f3935ed84f in /opt/conda/bin/python)
frame #22: PyImport_Cleanup + 0x99 (0x55f393653b79 in /opt/conda/bin/python)
frame #23: Py_FinalizeEx + 0x61 (0x55f3936be961 in /opt/conda/bin/python)
frame #24: Py_Main + 0x355 (0x55f3936c8eb5 in /opt/conda/bin/python)
frame #25: main + 0xee (0x55f393590b4e in /opt/conda/bin/python)
frame #26: __libc_start_main + 0xf0 (0x7f23918dc830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #27: + 0x1c61a8 (0x55f3936721a8 in /opt/conda/bin/python)

Traceback (most recent call last):
File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
__main__, mod_spec)
File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 235, in
main()
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', '/home/user/Desktop/workspace_zacurr/mmdetection/tools/train.py', '--local_rank=0', '/home/user/Desktop/workspace_zacurr/mmdetection/configs/fp16/mask_rcnn_r50_fpn_fp16_1x.py', '--launcher', 'pytorch', '--validate', '--work_dir', '/home/user/Desktop/workspace_zacurr/mmdetection/work_dirs/mask_rcnn_r50_fpn_fp16_1x']' died with .

Source

zacurr

Most helpful comment

@gittigxuy sorry for late response! I have just solved the problem, I found that is caused by the mismatch among the numbers of gt_bboxes, gt_labels and gt_masks. I filtered some bboxes out of the cropping range when applying crop operation, but forgot filtering the gt_labels and gt_masks.
So I guess your problem is caused by the same reason?

BlakeXiaochu on 1 Aug 2019

❤1 👍1

All 23 comments

i just ran the command

./tools/dist_train.sh configs/fp16/mask_rcnn_r50_fpn_fp16_1x.py 4

But I did not get this error. Does this error appear every time?

yhcao6 on 2 Jul 2019

yes. There is an only trivial difference between commands that I and you have used.
and Under the fp32 (default) setting, it doesn't have an error.
I will try this on the other server (Cuda 10) when GPUs are not busy..maybe one week later

zacurr on 3 Jul 2019

https://blog.csdn.net/sinat_29957455/article/details/95493564

guaiwuguba on 19 Jul 2019

@zacurr,when I add random_scale function in extra_aug.py,I encounter the same problem,I guess the reason is that my bbox is out of range,Am I right?when I remove the random_scale function,the model trains normally

gittigxuy on 23 Jul 2019

@yhcao6,when will you release data-pipeline to the master ,I am waiting for this part,When I write my own function,I just finish rotate function,but the scale part encounter the above error,so what should I do?

gittigxuy on 23 Jul 2019

The data-pipeline will not add extra operations such as rotation. There maybe some problems in your code, could you give a minimum example to reproduce this error?

yhcao6 on 23 Jul 2019

@yhcao6，the code is reference by https://github.com/Paperspace/DataAugmentationForObjectDetection,and I have checked it and have no problem after data augment.when I use random_shift and random_scale it does not work.

gittigxuy on 24 Jul 2019

@yhcao6 ,I find the same problem in the pytorch issue,https://github.com/pytorch/pytorch/issues/21136,so,what should I do to fix this bug?

gittigxuy on 30 Jul 2019

Could you give me a minimum example to reproduce the bug? So that I can check if there is something wrong in your code. Or maybe there is a bug in this repo.

yhcao6 on 30 Jul 2019

I have sent the code to your gmail，waiting for your reply.Thanks

gittigxuy on 30 Jul 2019

@gittigxuy Have you fixed your problem ? I met the same one when training on my custom dataset.

BlakeXiaochu on 1 Aug 2019

no，did you change other code？just train your own data to get this error？I add some data augment function to get this error，if I do not change code，I could train normally

gittigxuy on 1 Aug 2019

Yes, I changed the code to apply text detection. I converted the labels into coco format, use original CocoDataset, and no error occurred. But when I modified the code to add random scale and random crop, the error appears.

BlakeXiaochu on 1 Aug 2019

same problem，waiting for author to deal with the problem，I have sent the code to him

gittigxuy on 1 Aug 2019

Thx, if I fix it, I will tell you.

BlakeXiaochu on 1 Aug 2019

which data augment did you add,could I add your QQ or wechat?I just add random_rotate and it works fine

gittigxuy on 1 Aug 2019

BlakeXiaochu on 1 Aug 2019

❤1 👍1

thanks，meybe I meet the same probrem，so could you share your augment code to me？my email is [email protected]

gittigxuy on 1 Aug 2019

No problem @gittigxuy

BlakeXiaochu on 1 Aug 2019

I have sent the code to your gmail，waiting for your reply.Thank

def clip_box(bbox, clip_box, alpha):
    ar_ = (bbox_area(bbox))
    x_min = np.maximum(bbox[:, 0], clip_box[0]).reshape(-1, 1)
    y_min = np.maximum(bbox[:, 1], clip_box[1]).reshape(-1, 1)
    x_max = np.minimum(bbox[:, 2], clip_box[2]).reshape(-1, 1)
    y_max = np.minimum(bbox[:, 3], clip_box[3]).reshape(-1, 1)

    bbox = np.hstack((x_min, y_min, x_max, y_max, bbox[:, 4:]))

    delta_area = ((ar_ - bbox_area(bbox)) / ar_)

    mask = (delta_area < (1 - alpha)).astype(int)

    bbox = bbox[mask == 1, :]

    return bbox

This is the clip_box function in your code, which may delete some gt boxes. However, you forget to delete the corresponding gt labels.

yhcao6 on 2 Aug 2019

👍1

if i fix the code to your code，but get the same problem，what should I do？

gittigxuy on 13 Aug 2019

I meet the same problem, after I add some data into the dataset, I meet this error:
`RuntimeError: merge_sort: failed to synchronize: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1565287025495/work/c10/cuda/CUDACachingAllocator.cpp:569)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f5083808e37 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0x12e14 (0x7f5083a40e14 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: + 0x165bf (0x7f5083a445bf in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x74 (0x7f50837f3fa4 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #4: + 0x140fc34 (0x7f50868b8c34 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #5: + 0x31a4bf0 (0x7f508864dbf0 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #6: + 0x3756d12 (0x7f5088bffd12 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #7: torch::autograd::deleteNode(torch::autograd::Node*) + 0x7f (0x7f5088bffdbf in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #8: + 0x37739b1 (0x7f5088c1c9b1 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #9: c10::TensorImpl::release_resources() + 0x20 (0x7f50837f3f50 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #10: + 0x1bb014 (0x7f50aece0014 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #11: + 0x40142b (0x7f50aef2642b in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #12: + 0x401461 (0x7f50aef26461 in /home/titan-ubuntu/miniconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

frame #28: __libc_start_main + 0xf0 (0x7f50bdd19830 in /lib/x86_64-linux-gnu/libc.so.6)

已放弃 (核心已转储)
`
this is my annotations:
VOC20202020_000001.jpgThe VOC2020 DatabasePASCAL VOC2020flickr05003753

I already waste 3 days, but I cant solve the problem. anybody help me?, thank you very much,

SunNYNO1 on 14 Jan 2020

Maybe you can print the labels to ensure the maxvalue be in line with the num_classes

------------------ 原始邮件 ------------------
发件人: "sun"<[email protected]>;
发送时间: 2020年1月14日(星期二) 晚上11:31
收件人: "open-mmlab/mmdetection"<[email protected]>;
抄送: "郭彤彤"<[email protected]>; "Mention"<[email protected]>;
主题: Re: [open-mmlab/mmdetection] [fp16 training error] CUDA error: device-side assert triggered (#911)

frame #28: __libc_start_main + 0xf0 (0x7f50bdd19830 in /lib/x86_64-linux-gnu/libc.so.6)

已放弃 (核心已转储)
`
this is my annotations:
VOC20202020_000001.jpgThe VOC2020 DatabasePASCAL VOC2020flickr05003753person0012203680

I already waste 3 days, but I cant solve the problem. anybody help me?, thank you very much,

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.

guaiwuguba on 15 Jan 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Detectors loss_rpn_bbox: 0.0000

smkim17 · 16Comments

I can not import 'deform_conv_cuda'

abing222 · 24Comments

ValueError: need at least one array to concatenate

jevenail · 14Comments

Train with Negative Dataset

mdv3101 · 15Comments

Out of memory when training on custom dataset

pkdogcom · 38Comments